Skip to content

Conversation

MAVRICK-1
Copy link
Contributor

@MAVRICK-1 MAVRICK-1 commented Aug 27, 2025

🎯 Overview

This PR introduces a comprehensive detection rule for Stable Diffusion WebUI CUDA Out of Memory failures - addressing one of the most critical and widespread issues affecting AUTOMATIC1111 Stable Diffusion deployments globally. The rule identifies CUDA memory exhaustion leading to complete WebUI service failure requiring manual intervention.

CRE Playground Links

CRE-2025-0130 Playground: Test Rule

🚨 Problem Statement

High-Severity Issue: Stable Diffusion WebUI CUDA failures cause:

  • Complete service interruption - WebUI becomes unresponsive and requires manual restart
  • Loss of current image generation progress and any queued generation tasks
  • Potential CUDA context corruption requiring process restart to recover
  • User experience degradation with failed image generations and error messages
  • System instability in multi-user deployments where one user's OOM affects others
  • Cascading failures where recovery attempts also fail due to memory constraints

Why This Matters: Stable Diffusion CUDA failures are particularly dangerous because:

  • High-resolution image generation (1024x1024+) requires massive GPU VRAM
  • Failures often occur mid-generation causing complete data loss
  • AUTOMATIC1111 WebUI has millions of users globally
  • Issues manifest as generic crashes making diagnosis difficult
  • Memory fragmentation prevents allocation of required contiguous memory blocks
  • Requires immediate intervention to restore service functionality

Rule Performance

  • Detection Rate: 2 critical hits with sequence matching
  • Processing Speed: 64.52K lines/s processing
  • Window: 30-second detection window captures failure cascade
  • False Positive Rate: Low (specific PyTorch CUDA error patterns)

📊 Stable Diffusion Issues Covered

# Issue Type Example Error Pattern
1 CUDA Memory Exhaustion torch.cuda.OutOfMemoryError: CUDA out of memory
2 Model Loading Failures Failed to allocate tensor on device
3 Generation Process Crashes Fatal error during image generation
4 WebUI Unresponsiveness Gradio interface becoming unresponsive
5 Recovery Failures Recovery failed - WebUI requires restart
6 CUDA Context Corruption CUDA context may be corrupted
7 Complete Service Failure Complete service failure - manual intervention required

🧪 Testing & Validation

CRE Rule Testing

cd stable-diffusion-demo
cat logs/sd-webui-cuda-oom.log | preq -r ../rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml -d

Test Results:
Screenshot from 2025-08-27 13-17-40

🎬 Demo Environment

Repo link (private invitation already send) https://github.com/MAVRICK-1/cuda-oom

Screencast.from.2025-08-27.13-19-35.mp4
./start-demo.sh
cat logs/roop-cuda-oom.log | preq -r stable-diffusion-cuda-oom.yaml -d

Fixes #130
/claim #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stable Diffusion Web UI: Reproduce A High-Severity Failure & Write a CRE Rule [Multiple Winners] [Submit by August 31 11:59 pm ET]
1 participant